tests: posix: stress test for pthread_create and pthread_join #58218

cfriedt · 2023-05-24T04:17:20Z

Recently, a race condition was discovered in pthread_create() and pthread_join() that would trigger an assertion in kernel/sched.c of the form below.

aborted _current back from dead

This was mainly observed in Qemu. Unfortunately, Qemu SMP platforms exhibit a real and measurable "scheduler noise" which makes testing rather difficult.

Additional changes:

revert 7c17bda (it was not a fix)
add a small delay in pthread_descriptor_leak (only reduces the likelihood of scheduler noise impacting the test)

cfriedt · 2023-05-24T14:34:28Z

Related:
#58116

Previous PR (closed, because it was basically a rewrite of pthreads, but did not improve reliability)
#57637

tests/posix/common/src/pthread.c

tests/posix/pthread_pressure/src/main.c

tests/posix/pthread_pressure/testcase.yaml

keith-packard

I just had a few questions before I click the approve button. I don't have any concrete suggestions for changes.

cfriedt · 2023-05-25T23:43:25Z

Rebased on top of several recent RISC-V changes and it is showing definite stability improvements on this test. I'm going to enable it now. I am unfortunately still seeing major issues with qemu_cortex_a53_smp though, so keeping qemu_riscv64_smp as the only integration platform.

cfriedt · 2023-05-25T23:44:07Z

@hakehuang, @keith-packard - PTAL

cfriedt · 2023-05-25T23:51:22Z

Interesting benchmarks as well.

*** Booting Zephyr OS build zephyr-v3.3.0-4448-gd4ff8445c839 ***
Running TESTSUITE pthread_pressure
===================================================================
START - test_k_thread_create_join
BOARD: qemu_riscv64_smp
NUM_THREADS: 2
TEST_NUM_CPUS: 2
TEST_DURATION_S: 29
TEST_DELAY_US: 0
...
now (ms): 29000 end (ms): 29000
Thread 0 created and joined 129865 times
Thread 1 created and joined 129865 times
 PASS - test_k_thread_create_join in 28.999 seconds
===================================================================
START - test_pthread_create_join
BOARD: qemu_riscv64_smp
NUM_THREADS: 2
TEST_NUM_CPUS: 2
TEST_DURATION_S: 29
TEST_DELAY_US: 0
...
now (ms): 58000 end (ms): 58000
Thread 0 created and joined 32818 times
Thread 1 created and joined 32818 times
 PASS - test_pthread_create_join in 28.998 seconds
===================================================================
TESTSUITE pthread_pressure succeeded

------ TESTSUITE SUMMARY START ------

SUITE PASS - 100.00% [pthread_pressure]: pass = 2, fail = 0, skip = 0, total = 2 duration = 57.997 seconds
 - PASS - [pthread_pressure.test_k_thread_create_join] duration = 28.999 seconds
 - PASS - [pthread_pressure.test_pthread_create_join] duration = 28.998 seconds

------ TESTSUITE SUMMARY END ------

===================================================================
PROJECT EXECUTION SUCCESSFUL

cfriedt · 2023-05-26T09:47:46Z

@hakehuang @keith-packard - PTAL again

cfriedt · 2023-05-26T13:37:36Z

Code all looks great, so I decided to take 'er out for a spin. I think you'll need to adjust the test timing to reduce the false negatives.

Thanks - yeah, I always forget that integration_platforms does not mean "only run in CI on these platforms on PR". That could arguably be an improvement, TBH. Honestly, testing non-integration platforms should be more of a nightly thing.

CC @stephanosio - might need your merge superpowers to get past the false positive.

andyross · 2023-05-26T15:03:41Z

@keith-packard note that all tests on SMP platforms get retried twice in CI. There are known glitches within timing between SMP cores that can't be worked around, and that affect some of the naive timing logic in many of the tests (you can have one CPU get halted for 10ms+ because the host OS didn't schedule it, etc...).

So if you're only seeing a handful of failures, that's probably not going to be detectable by CI. Just run twister with "--only-failed" after your run and verify that they succeed reliably the second time (they almost always do, because the multicore host is now less loaded with no builds going on).

FWIW: if the frequency can be measured and it's going up with this PR, though, we should track that somewhere. The kernel bug that was opened is as good a place as any if you have numbers like that.

keith-packard · 2023-05-26T16:28:46Z

@hakehuang @keith-packard - PTAL again

I'm getting a repeatable failure with qemu_x86_64 -- did you intend to exclude that platform as well? Or am I hitting an unexpected problem? It's the 'aborted _current back from dead' trouble in test_pthread_create_join case, ofc.

andyross · 2023-05-26T16:55:02Z

FWIW: @carlocaione found a fix for the kernel bug in #58116. It's just a one-liner you could try right now. I'm working up a slightly more gold plated version, but just one wait_for_switch() call was all that was missing.

keith-packard · 2023-05-26T17:07:42Z

@keith-packard note that all tests on SMP platforms get retried twice in CI. There are known glitches within timing between SMP cores that can't be worked around, and that affect some of the naive timing logic in many of the tests (you can have one CPU get halted for 10ms+ because the host OS didn't schedule it, etc...).

Yeah, I do try to remember to run a second time with -f to see if that's the trouble. In this case, having each of the two internal tests run for 29 seconds while the overall timeout was set at 60 seconds was causing spurious failure more times than not, leading me to be concerned that we'd hit failures even with retries.

FWIW: if the frequency can be measured and it's going up with this PR, though, we should track that somewhere. The kernel bug that was opened is as good a place as any if you have numbers like that.

I'll see if I can find some time to poke at that again.

cfriedt · 2023-05-26T20:23:16Z

I'm getting a repeatable failure with qemu_x86_64 -- did you intend to exclude that platform as well? Or am I hitting an unexpected problem? It's the 'aborted _current back from dead' trouble in test_pthread_create_join case, ofc.

@keith-packard - I could exclude that platform too if it becomes problematic.

IIRC, we have --retry-failed 3 in CI which should allow things to retry.

keith-packard · 2023-05-26T20:52:44Z

I'm getting a repeatable failure with qemu_x86_64 -- did you intend to exclude that platform as well? Or am I hitting an unexpected problem? It's the 'aborted _current back from dead' trouble in test_pthread_create_join case, ofc.

@keith-packard - I could exclude that platform too if it becomes problematic.

IIRC, we have --retry-failed 3 in CI which should allow things to retry.

Yeah, and CI is happy. Ok, I think this is 'good enough' and will let us tweak it to validate any core kernel fixes.

hakehuang

need fix the compliance issue. Others looks good to me

keith-packard · 2023-05-29T08:02:17Z

need fix the compliance issue. Others looks good to me

looks like the compliance issue isn't real? Just using __unused attribute on a variable declaration?

cfriedt · 2023-05-29T12:22:14Z

@stephanosio - you should be ok to merge this I think - PTAL

cfriedt · 2023-05-29T12:23:33Z

@jgl-meta - also, PTAL. I would suggest running this a few times on qemu_riscv64_smp

stephanosio · 2023-05-29T12:59:13Z

tests/posix/pthread_pressure/src/main.c

Instead of adding __unused attribute to int ret, I would suggest using if (IS_ENABLED(...)) instead of #ifdef so that the compiler is aware of the execution path that uses ret.

Yea... I was really on the fence there. In theory, the linker should discard any dead code, and perhaps it's just me being paranoid, but I found that the #ifdef seemed to result in fewer failed assertions.

I'll respin. Ultimately though, I want @jgl-meta to make the call whether this workaround / test is merged first or whether he will hold out for pthreads to be water tight.

The latter is ideal, but there is not always time for the ideal.

Yea... I was really on the fence there. In theory, the linker should discard any dead code, and perhaps it's just me being paranoid, but I found that the #ifdef seemed to result in fewer failed assertions.

Yes, compiler should optimise out any dead code. Linux kernel coding style also recommends using IS_ENABLED instead of preprocessor conditionals whenever possible for this reason -- maybe it is about the time we document this in the Zephyr coding guidelines ...

I'll respin. Ultimately though, I want @jgl-meta to make the call whether this workaround / test is merged first or whether he will hold out for pthreads to be water tight

@cfriedt I think this can go ahead and be merged. I think it is better to have this than make no mention of it at all.

Made @stephanosio 's suggestions.

This reverts commit 7c17bda. Signed-off-by: Christopher Friedt <[email protected]>

The `aborted _current back from dead` error may appear in this particular test under Qemu (and in particular Qemu SMP) systems. A small delay between `pthread_create()` and `pthread_join()` is sufficient to mitigate the issue. Signed-off-by: Christopher Friedt <[email protected]>

Recently, a race condition was discovered in `pthread_create()` and `pthread_join()` that would trigger an assertion in `kernel/sched.c` of the form below. ``` aborted _current back from dead ``` This was mainly observed in Qemu. Unfortunately, Qemu SMP platforms exhibit a real and measurable "scheduler noise" which makes testing rather difficult. Signed-off-by: Christopher Friedt <[email protected]>

cfriedt force-pushed the pthread-pressure branch from 4518ece to 358cbf3 Compare May 24, 2023 14:30

cfriedt marked this pull request as ready for review May 24, 2023 14:31

cfriedt added area: Tests Issues related to a particular existing or missing test area: POSIX POSIX API Library labels May 24, 2023

zephyrbot assigned cfriedt May 24, 2023

cfriedt modified the milestones: v2.7.5, v3.4.0 May 24, 2023

cfriedt requested review from andyross, dleach02, hakehuang, jgl-meta and keith-packard May 24, 2023 14:32

cfriedt requested a review from stephanosio May 24, 2023 14:34

hakehuang previously approved these changes May 24, 2023

View reviewed changes